Systems and Methods for Medical Topic Discovery Based on Large-Scale Machine Learning

ABSTRACT

A machine learning system that includes a plurality of machine learning processors, maintains a topic matrix that represents the relevancies or measures of prevalence of a plurality of medical topics among a plurality of clinical documents. Each processor in the system is configured to determine at least one local sufficient factor group for a document included in the plurality of documents, and to send the at least one local sufficient factor group to one or more other processors in the system. Each processor is further configured to receive at least one remote sufficient factor group from another processor in the system, and to process the local sufficient factor group together with the remote sufficient factor group to obtain the topic matrix. The remote sufficient factor group or groups are determined by other processors in the system for another document included in the plurality of documents.

CROSS-REFERENCE TO RELATED APPLICATION(S)

This application claims the benefit of and priority to 1) U.S.Provisional Patent Application Ser. No. 62/699,385, filed Jul. 17, 2018,for “Diversity-Promoting and Large-Scale Machine Learning forHealthcare”, and 2) U.S. Provisional Patent Application Ser. No.62/756,024, filed Nov. 5, 2018, for “Diversity-Promoting and Large-ScaleMachine Learning for Healthcare”, the entire disclosures of which areincorporated herein by references.

This application has subject matter in common with: 1) U.S. patentapplication Ser. No. 16/038,895, filed Jul. 18, 2018, for “A MachineLearning System for Measuring Patient Similarity”, 2) U.S. patentapplication Ser. No. 15/946,482, filed Apr. 5, 2018, for “A MachineLearning System for Disease, Patient, and Drug Co-Embedding, andMulti-Drug Recommendation”, 3) U.S. Patent Application Ser. No. _____,filed _____, for “Systems and Methods for Predicting Medications toPrescribe to a Patient Based on Machine Learning”, 4) U.S. PatentApplication Ser. No. _____, filed _____, for “Systems and Methods forAutomatically Tagging Concepts to, and Generating Text Reports for,Medical Images Based on Machine Learning”, 5) U.S. Patent ApplicationSer. No. _____, filed _____, for “Systems and Methods for AutomaticallyGenerating International Classification of Disease Codes for a PatientBased on Machine Learning”, the entire disclosures of which areincorporated herein by reference, and the entire disclosures of whichare incorporated herein by reference.

TECHNICAL FIELD

The present disclosure generally relates to machine learning forhealthcare, and more particularly, to large-scale machine learningsystems and methods that process clinical documents to deriveinformative matrices that represent the relevancies or measures ofprevalence of a plurality of medical topics among a plurality ofclinical documents.

BACKGROUND

With the widespread adoption of electronic health records (EHR) systems,and the rapid development of new technologies such as high-throughputmedical imaging devices, low-cost genome profiling systems, networkedand even wearable sensors, mobile applications, and rich accumulation ofmedical knowledge/discoveries in databases, a tsunami of medical andhealthcare data has emerged. It was estimated that 153 exabytes (oneexabyte equals one billion gigabytes) of healthcare data were producedin 2013. In 2020, an estimated 2314 exabytes will be produced. From 2013to 2020, an overall rate of increase is at least 48 percent annually.

In addition to the sheer volume, the complexity of healthcare data isalso overwhelming. Such data includes clinical notes, medical images,lab values, vital signs, etc., coming from multiple heterogeneousmodalities including texts, images, tabular data, time series, graph andso on. The rich clinical data is becoming an increasingly importantsource of holistic and detailed information for both healthcareproviders and receivers. Collectively analyzing and digesting these richinformation generated from multiple sources; uncovering the healthimplications, risk factors, and mechanisms underlying the heterogeneousand noisy data records at both individual patient and whole populationlevels; making clinical decisions including diagnosis, triage, andtreatment thereupon, are now routine activities expected to be conductedby medical professionals including physicians, nurses, pharmacists andso on.

As the amount and complexity of medical data are rapidly growing, theseactivities are becoming increasingly more difficult for human experts.The information overload makes medical analytics and decisions-makingtime consuming, error-prone, suboptimal, and less-transparent. As aresult, physicians, patients, and hospitals suffer a number of painpoints, quality-wise and efficiency-wise. For example, in terms ofquality, 250,000 Americans die each year from medical errors, which hasbecome the third leading cause of death in the United States. Twelvemillion Americans are misdiagnosed each year. Preventable medicationerrors impact more than 7 million patients and cost almost $21 billionannually. Fifteen to twenty-five percent of patients are readmittedwithin 30 days and readmissions are costly (e.g., $41.3 billion in2011). In terms of inefficiency, patients wait on average 6 hours inemergency rooms. Nearly 400,000 patients wait 24 hours or more.Physicians spend only 27 percent of their office day on direct clinicalface time with patients. The U.S. healthcare system wastes $750 billionannually due to unnecessary services, inefficient care delivery, excessadministrative costs, etc.

The advancement of machine learning (ML) technology opens upopportunities for next generation computer-aided medical data analysisand data-driven clinical decision making, where machine learningalgorithms and systems can be developed to automatically andcollectively digest massive medical data such as electronic healthrecords, images, behavioral data, and the genome, to make data-drivenand intelligent diagnostic predictions. A machine learning system canautomatically analyze multiple sources of information with richstructure, uncover the medically meaningful hidden concepts fromlow-level records to aid medical professionals to easily and conciselyunderstand the medical data, and create a compact set of informativediagnostic procedures and treatment courses and make healthcarerecommendations thereupon.

It is therefore desirable to leverage the power of machine learning inautomatically distilling insights from large-scale heterogeneous datafor automatic smart data-driven medical predictions, recommendations,and decision-making, to assist physicians and hospitals in improving thequality and efficiency of healthcare. It is further desirable to havemachine learning algorithms and systems that turn the raw clinical datainto actionable insights for clinical applications. One such clinicalapplication relates to discovering medical topics from large-scaletexts.

When applying machine learning to healthcare application, severalfundamental issues may arise, including:

1) How to better capture infrequent patterns: At the core of ML-basedhealthcare is to discover the latent patterns (e.g., topics in clinicalnotes, disease subtypes, phenotypes) underlying the observed clinicaldata. Under many circumstances, the frequency of patterns is highlyimbalanced. Some patterns have very high frequency while others occurless frequently. Existing ML models lack the capability of capturinginfrequent patterns. Known convolutional neural network do not performwell on infrequent patterns. Such a deficiency of existing modelspossibly results from the design of their objective function used fortraining. For example, a maximum likelihood estimator would rewarditself by modeling the frequent patterns well as they are the majorcontributors to the likelihood function. On the other hand, infrequentpatterns contribute much less to the likelihood, thereby it is not veryrewarding to model them well and they tend to be ignored. Infrequentpatterns are of crucial importance in clinical settings. For example,many infrequent diseases are life-threatening. It is critical to capturethem.

2) How to alleviate overfitting: In certain clinical applications, thenumber of medical records available for training is limited. Forexample, when training a diagnostic model for an infrequent disease,typically there is no access to a sufficiently large number of patientcases due to the rareness of this disease. Under such circumstances,overfitting easily happens, wherein the trained model works well on thetraining data but generalizes poorly on unseen patients. It is criticalto alleviate overfitting.

3) How to improve interpretability: Being interpretable and transparentis a must for an ML model to be willingly used by human physicians.Oftentimes, the patterns extracted by existing ML methods have a lot ofredundancy and overlap, which are ambiguous and difficult to interpret.For example, in computational phenotyping from EHRs, it is observed thatthe learned phenotypes by the standard matrix and tensor factorizationalgorithms have much overlap, causing confusion such as two similartreatment plans are learned for the same type of disease. It isnecessary to make the learned patterns distinct and interpretable.

4) How to compress model size without sacrificing modeling power: Inclinical practice, making a timely decision is crucial for improvingpatient outcome. To achieve time efficiency, the size (specifically, thenumber of weight parameters) of ML models needs to be kept small.However, reducing the model size, which accordingly reduces the capacityand expressivity of this model, typically sacrifice modeling power andperformance. It is technically appealing but challenging to compressmodel size without losing performance.

5) How to efficiently learn large-scale models: In certain healthcareapplications, both the model size and data size are large, incurringsubstantial computation overhead that exceeds the capacity of a singlemachine. It is necessary to design and build distributed systems toefficiently train such models.

Discovering medical topics from clinical documents has manyapplications, such as consumer medical search, mining FDA drug labels,investigating drug repositioning opportunities, to name a few. Inpractice, the clinical text corpus can contain millions of documents andthe medical dictionary is comprised of hundreds of thousands ofterminologies. These largescale documents contain rich medical topics,whose number can be tens of thousands. How to efficiently discover somany topics from such a large dataset is computationally challenging.

SUMMARY

In one aspect of the disclosure, a machine learning system that includesa plurality of machine learning processors, maintains a topic matrixthat represents the relevancies or measures of prevalence of a pluralityof medical topics among a plurality of clinical documents. Eachprocessor in the system is configured to determine at least one localsufficient factor group for a document included in the plurality ofdocuments, and to send the at least one local sufficient factor group toone or more other processors in the system. Each processor is furtherconfigured to receive at least one remote sufficient factor group fromanother processor in the system. The remote sufficient factor group orgroups are determined by other processors in the system for anotherdocument included in the plurality of documents. Each processorprocesses its local sufficient factor group together with the remotesufficient factor group or groups it receives to obtain the topicmatrix.

In another aspect of the disclosure, a method of creating a topic matrixthat represents a prevalence of each of a plurality of medical topicsamong a plurality of clinical documents, includes determining at leastone local sufficient factor group for one or more documents included inthe plurality of clinical documents using a first professor included ina machine learning system comprising a plurality of machine learningprocessors. The method further includes sending the at least one localsufficient factor group from the first processor to one or more secondprocessors in the system. The method also includes receiving at thefirst processor, at least one remote sufficient factor group from asecond processor in the system, and processing the local sufficientfactor group together with the remote sufficient factor group to obtainthe topic matrix at the first processor. The at least one remotesufficient factor group is determined by the second processor foranother document included in the plurality of clinical documents.

It is understood that other aspects of methods and systems will becomereadily apparent to those skilled in the art from the following detaileddescription, wherein various aspects are shown and described by way ofillustration.

BRIEF DESCRIPTION OF THE DRAWINGS

Various aspects of apparatuses and methods will now be presented in thedetailed description by way of example, and not by way of limitation,with reference to the accompanying drawings, wherein:

FIG. 1 is a block diagram of a medical topic discovery systemimplemented by a large-scale peer-to-peer distributed machine learningsystem including a plurality of processors.

FIG. 2 is a block diagram of a sufficient factor broadcasting modeladopted by the peer-to-peer distributed machine learning system of FIG.1.

FIG. 3 is a block diagram of a random multicast model adopted by thepeer-to-peer distributed machine learning system of FIG. 1.

FIG. 4 is an illustration of an algorithm, referred to as Algorithm 3,that implements a sufficient factor selection process used by thepeer-to-peer distributed machine learning system of FIG. 1.

FIG. 5 is an illustration of an algorithm, referred to as Algorithm 4,that implements a sufficient factor group transformation process used bythe peer-to-peer distributed machine learning system of FIG. 1.

FIG. 6 is an expression graph representing the parsing of an expressionrelated to a sufficient factor identification process used by thepeer-to-peer distributed machine learning system of FIG. 1.

FIG. 7 is a block diagram of a software stack included in the processorsof the peer-to-peer distributed machine learning system of FIG. 1.

FIG. 8 is a flowchart of a method of creating a topic matrix thatrepresents a prevalence of each of a plurality of medical topics among aplurality of clinical documents.

FIG. 9 is a block diagram of an apparatus, e.g., machine, processor, orworker, included in the large-scale peer-to-peer distributed machinelearning system of FIG. 1 that implements the method of FIG. 8.

DETAILED DESCRIPTION

With reference to FIG. 1, a medical topic discovery system 100configured in accordance with the large-scale distributed machinelearning system disclosed herein, includes a plurality of machinelearning processors 102, 104, 106. The processors 102, 104, 106, alsoreferred to herein as machines or workers, perform individual machinelearning tasks and share the results of these tasks with other processorin the system 100 in order to maintain a shared topic matrix 110 thatrepresents the relevancies or measures of prevalence of differentmedical topics addressed in a corpus of clinical documents 112. For easeof illustration, only three machine learning processors 102, 104, 106are shown in FIG. 1. The system 100 may, however, include manyadditional processors. Some of the concepts and features describedherein are included in Diversity-promoting and Large-scale MachineLearning for Healthcare, a thesis submitted by Pengtao Xie in August2018 to the Machine Learning Department, School of Computer Science,Carnegie Mellon University, which is hereby incorporated by reference inits entirety.

Each processor 102, 104, 106 in the system 100 is configured todetermine at least one local sufficient factor group (LSFG) for adocument included in the corpus of clinical documents 112, and to sendthe local sufficient factor group to one or more other processors in thesystem. For example, as shown in FIG. 1, a first processor 102 maydetermine and send a local sufficient factor group LSFG1 to the otherprocessors 104, 106 in the system. The local sufficient factor groupincludes two sufficient factors, each corresponding to a vectorrepresenting a measure, e.g., association strength, between words in theclinical document and a medical topic addressed in the clinicaldocument. These sufficient factor vectors may be obtained.

Each processor 102, 104, 106 in the system 100 is further configured toreceive at least one remote sufficient factor group (RSFG) from anotherprocessor in the system, and to process its local sufficient factorgroup together with the received remote sufficient factor group toobtain the topic matrix 110. Each remote sufficient factor group isdetermined by other processors in the system for another documentincluded in the corpus of clinical documents 112.

Continuing with the example shown in FIG. 1, the first processor 102 mayreceive a remote sufficient factor group RSFG2 from the second processor104 in the system 100, and a remote sufficient factor group RSFG. fromthe nth processor 106 in the system. Like the local sufficient factorgroup determined by the first processor 102, each remote sufficientfactor group determined by other processors 104, 106 in the systemincludes two sufficient factors, each corresponding to a vectorrepresenting a measure, e.g., association strength, between words inanother clinical document and a medical topic addressed in this otherclinical document.

It is noted that the “local” and “remote” nomenclature used indescribing the system 100 is relative to each individual processor 102,104, 106. More specifically, a sufficient factor group determine by aprocessor is: 1) a “local” sufficient factor group for that processorand 2) a “remote” sufficient factor group for all other processors inthe system.

Continuing with FIG. 1, in some configurations, an individual processor102, 104, 106 may determine more than one local sufficient factor group.For example, a processor may determine a plurality of local sufficientfactor groups for a corresponding plurality of clinical documentsincluded in the corpus of clinical documents 112. In such cases, where aprocessor has determined multiple sufficient factor groups, theprocessor may be further configured to select and send a subset of theplurality of sufficient factor groups to the one or more otherprocessors in the system.

In some configuration, instead of being sent to all other processors inthe system 100, a local sufficient factor group may be sent by aprocessor 102, 104, 106 to a select subset of other processors in thesystem. To this end, the processors 102, 104, 106 are further configuredto randomly select, from among a plurality of other processors in thesystem, the subset of other processors to which to send the localsufficient factor group.

Upon receipt of one or more remote sufficient factor groups, a processor102 processes its local sufficient factor group together with thereceived remote sufficient factor group or groups it received from otherprocessor 104, 106 in the system 100. Accordingly, the processor 102 isconfigured to convert each of the local sufficient factor group and theremote sufficient factor group or groups into a corresponding updatematrix, and apply each update matrix to the topic matrix using aprojection operation. In one configuration, the processor 102 convertseach of the local sufficient factor group and the remote sufficientfactor group or groups into a corresponding update matrix by obtainingan outer product of the sufficient factors that respectively define thelocal sufficient factor group and the remote sufficient factor group.The outcome of this process is an updated or present state of the topicmatrix 110.

The medical topic discovery system 100 thus described may efficientlyprocess a large corpus of clinical documents 112 by dividing the initialtask of document processing among the processors in the system andsharing the results of the document processing, e.g., the sufficientfactor groups, across the system. The sharing of results enables thecreation of a topic matrix through an iterative approach, where aninitial topic matrix derived from an initial set of documents is updatedas additional sets of documents are processed by the system 100.

Having thus described the general configuration and operation of amedical topic discovery system 100, following is a description of alarge-scale distributed learning architecture that may be used toimplement the medical topic discovery system.

Large-Scale Distributed Learning

With continued reference to FIG. 1, the medical topic discovery system100 may be implemented in the from of a large-scale peer-to-peerdistributed machine learning system. The system 100 relies onmachine-learned models that are parameterized by matrices in the form ofsufficient factor (SF) properties. The system significantly reducescommunication and computation costs. For efficient communication, thesystem uses: 1) sufficient factor broadcasting to transfer small-sizedvectors among machines for the synchronization of matrix-formparameters, 2) random multicast where each machine randomly selects asubset of machines to communicate within each clock, and 3) sufficientfactor selection that selects a subset of most representative sufficientfactors to communicate. These characteristics of the system greatlyreduce the number of network messages and the size of each message. Forefficient computation, the system uses: 1) sufficient factors torepresent parameter matrices and 2) and a sufficient-factor-awareapproach for matrix-vector multiplication, which reduces the cost ofbeing quadratic in matrix dimensions down to being linear.

Sufficient Factor Property

The system invokes a mathematical property of a large family of machinelearning models that admits the following optimization formulation:

$\begin{matrix}(P) & \; \\{{\min\limits_{W}{\frac{1}{N}{\sum\limits_{i = 1}^{N}{f_{i}( W_{a_{i}} )}}}} + {h(W)}} & ( {{Eq}.\mspace{14mu} 1} )\end{matrix}$

The model is parametrized by a matrix W ∈ R^(j×D). The loss functionf_(i) ^((.)) is typically defined over a set of training samples to{(a_(i), b_(i))}_(i=1) ^(N), with the dependence on b_(i) beingsuppressed. f_(i) ^((.)) is allowed to be either convex or nonconvex,smooth or nonsmooth (with subgradient everywhere). Examples include l₂loss and multiclass logistic loss, amongst others. The regularizer h(W)is assumed to admit an efficient proximal operator prox_(h)(.). Forexample, h(.) could be an indicator function of convex constraints, l₁-,l₂-, trace-norm, to name a few. The vectors a_(i) and b_(i) canrepresent observed features, supervised information (e.g., class labelsin classification, response values in regression), or even unobservedauxiliary information (such as sparse codes in sparse coding) associatedwith data sample i. The key property exploited below ranges from thematrix-vector multiplication Wa_(i). This optimization problem (P) canbe used to represent a rich set of machine learning models, such assparse coding.

Sparse coding learns a dictionary of basis from data, so that the datacan be re-represented sparsely (and thus efficiently) in terms of thedictionary. In sparse coding, W is the dictionary matrix, a_(i) are thesparse codes, b_(i) is the input feature vector, and f_(i)(.) is aquadratic function. To prevent the entries in W from becoming too large,each column W_(k) must satisfy ∥W_(k)∥₂≤1. In this case, h(W) is anindicator function which equals 0 if W satisfies the constraints andequals co otherwise. See, e.g., Bruno A Olshausen and David J Field.Sparse coding with an overcomplete basis set: A strategy employed by v1?Vision research, 1997, the disclosure of which is incorporated byreference.

To solve the optimization problem (P), it is common to employ eitherproximal stochastic gradient descent (SGD) or stochastic dual coordinateascent (SDCA) both of which are popular and well-established paralleloptimization techniques. See, e.g., Trishul Chilimbi, Yutaka Suzue,Johnson Apacible, and Karthik Kalyanaraman. Project Adam: building anefficient and scalable deep learning training system. In USENIXSymposium on Operating Systems Design and Implementation, 2014, andCho-Jui Hsieh, Kai-Wei Chang, Chih-Jen Lin, S Sathiya Keerthi, andSellamanickam Sundararajan. A dual coordinate descent method forlarge-scale linear SVM. In International conference on machine learning,2008, the disclosures of which are incorporated by reference.

Proximal SGD: In proximal SGD, a stochastic estimate of the gradient,ΔW, is first computed over one data sample (or a mini-batch of samples),in order to update W via W←W−η ΔW (where η is the learning rate).Following this, the proximal operator prox_(ηh)(.) is applied to W.Notably, the stochastic gradient ΔW in (P) can be written as the outerproduct of two vectors ΔW=uv^(T), where u=(^(af)(W_(a) _(i),b_(i))/^(af)(W_(a) _(i) , v=a_(i)), according to the chain rule. Laterbelow, it is shown that this low rank structure of W can reduce theamount of communication among machines in a large-scale system.

Stochastic DCA: SDCA applies to problems (P) where f_(i)( )is convex andh( )is strongly convex (e.g. when h( )contains the squared /!₂ norm).SCDA solves the dual problem of (P), via stochastic coordinate ascent onthe dual variables. Introducing the dual matrix U=[u₁, . . . , u_(N)]R^(J×N) and the data matrix A=[a₁, . . . , a_(N)] R^(D×N), the dualproblem of (P) can be written as:

$\begin{matrix}(D) & \; \\{{\min\limits_{U}{\frac{1}{N}{\sum\limits_{i = 1}^{N}{f_{i}^{*}( {- u_{i}} )}}}} + {h*( {\frac{1}{N}{UA}^{T}} )}} & ( {{Eq}.\mspace{14mu} 2} )\end{matrix}$

f_(i) ^(*)(.) and h_(i) ^(*)(.) are the Fenchel conjugate functions off_(i)(.) and h(.), respectively.

The primal-dual matrices W and U are connected by W=Δh * (Z), where theauxiliary matrix

$Z:={\frac{1}{N}{{UA}^{T}.}}$

Algorithmically, the dual matrix U, the primal matrix W, and theauxiliary matrix Z are updated. Every iteration, a random data sample iis picked and the stochastic update Du_(i) is computed by minimizing (D)while holding {u_(j)}_(j≠1) fixed. The dual variable is updated viau_(i)←u_(i)−Δu_(i), the auxiliary variable via Z←Z−Δu_(i)a_(i) ^(T), andthe primal variable via W=∇h * (Z),). Similar to stochastic gradientdescent, the update of Z is also the outer product of two vectors:Δu_(i) and a_(i), which can be exploited to reduce communication cost.

Sufficient Factor Property in SGD and SDCA: In both SGD and SDCA, theparameter matrix update can be computed as the outer product of twovectors, which vectors are referred to herein as sufficient factors. Theset of sufficient factors that are generated with respect to one dataexample and that atomically produce a parameter update is referred to asa sufficient factor group (SFG). This property can be leveraged toimprove the communication efficiency of distributed machine learningsystems. Instead of communicating updated parameter matrices amongmachines, the sufficient factors are communicated among the machines inthe form of a sufficient factor group and the update matrices arereconstructed locally at each machine. Because the sufficient factorsare much smaller in size, synchronization costs can be dramaticallyreduced.

Peer-to-Peer Communication Based on Sufficient Factors

As just mentioned, the sufficient factor property may be leveraged toreduce communication cost. To ensure the consistency among differentcopies or replicas of the parameter matrix, the parameter matrix updatescomputed at different machines need to be exchanged. One popular systemarchitecture that enables this is parameter server (PS), which consistsof a server machine that maintains a shared state of the parametermatrix and a set of worker machines each having a local cache of theparameter matrix. In parameter server, the parameter updates computed atworker machines are aggregated at the server and applied to the sharedstate of the parameter matrix that is maintained at the server. Theserver subsequently sends the shared state of the parameter matrix toworker machines and the worker machines refresh their local cache of theparameter matrix to match the shared state of the parameter matrix thatis maintained at the server. When parameter server is used to trainmatrix-parametrized models, updated parameter matrices—which couldcontain billions of elements—are transferred, incurring substantialcommunication overhead.

A large-scale peer-to-peer (P2P) distributed machine learning system maybe used in place of parameter server framework where the system designis driven by the sufficient factor property. The large-scalepeer-to-peer distributed learning architecture is a decentralized systemthat executes data-parallel distributed training of matrix-parameterizedmachine learning models. The decentralized system runs on a group ofworker machines connected via a peer-to-peer network. Unlike theclient-server architectures including the parameter server, machines inthe decentralized system play equal roles without server/clientasymmetry and every pair of machines can communicate. Each machine holdsone shard of the data and a replica of the model parameters. Machinessynchronize their model replicas to ensure consistency by exchangingparameter-(pre)updates via network communication. Under this generalframework, the decentralized system applies a battery of systemalgorithm co-designs to achieve efficiency in communication and faulttolerance.

For efficient communication, a feature of the decentralized system is torepresent the parameter update matrices by their correspondingsufficient factors, which can be understood as “pre-updates”, meaningthat the actual update matrices must be computed on each machine uponreceiving fresh sufficient factors, and the update matrices themselvesare never transmitted. Since the size of sufficient factors is muchsmaller than matrices, the communication cost can be substantiallyreduced. Under a peer-to-peer architecture, in addition to avoidingtransmitting update matrices, the decentralized system can also avoidtransmitting parameter matrices, while still achieving synchrony.Besides, random multicast, under which each machine sends sufficientfactors to a randomly-chosen subset of machines, is leveraged to reducethe number of messages. Sufficient factor selection, which chooses asubset of representative sufficient factors to communicate, is used tofurther reduce the size of each message.

The decentralized system uses incremental sufficient factor checkpointfor fault tolerance, motivated by the fact that the parameter states canbe represented as a dynamically growing set of sufficient factors.Machines continuously save the new sufficient factors computed in eachlogical time onto stable storage. To recover a parameter state, thedecentralized system transforms the saved sufficient factors into amatrix. Compared with checkpointing parameter matrices, saving vectorsrequires much less disk input/output and does not require theapplication program to halt. Besides, the parameters can be rollbackedto the state in any logical time.

In programming abstraction, the sufficient factors are explicitlyexposed such that system-level optimizations based on sufficient factorscan be exploited. The decentralized system is able to automaticallyidentify the symbolic expressions representing sufficient factors andupdates, relieving users' burden to manually specify them.

The decentralized system supports two consistency models: bulksynchronous parallel (BSP) and staleness synchronous parallel (SSP).bulk synchronous parallel sets a global barrier at each clock. A workercannot proceed to the next clock until all workers reach this barrier.Staleness synchronous parallel allows workers to have different paces aslong as their difference in clock is no more than a user-definedstaleness threshold.

Sufficient Factor Broadcasting

With reference to FIG. 2, leveraging the sufficient factor property ofthe update matrix in problems (P) and (D), a sufficient factorbroadcasting (SFB) computation model that supports efficient(low-communication) distributed learning of the parameter matrix W isused. In a setting with P workers, each of which holds a data shard anda copy of the parameter matrix W. Stochastic updates to W are generatedvia proximal SGD or SDCA, and communicated between machines to ensureparameter consistency.

In proximal SGD, on every iteration, each worker p computes sufficientfactors (u_(p), v_(p)), based on one data sample x_(i)=(a_(i), b_(i)) inthe worker's data shard. The worker then broadcasts (u_(p), v_(p)) toall other workers. Once all P workers have performed their broadcast(and have thus received all sufficient factors), they re-construct the Pupdate matrices (one per data sample) from the P sufficient factors, andapply them to update their local copy of W. Finally, each worker appliesthe proximal operator prox_(h)(.). When using SDCA, the above procedureis instead used to broadcast sufficient factors for the auxiliary matrixZ, which is then used to obtain the primal matrix W=∇h*(Z).

In FIG. 2, the SFB operation is performed by four workers. The workerscompute their respective sufficient factors (u₁, v₁) , . . . , (u₄, v₄),which are then broadcast to the other three workers. Each workerp usesall four sufficient factors (u₁, v₁) , . . . , (u₄, v₄) to exactlyreconstruct the update matrices ΔW_(p)=u_(p)v_(p) ^(T), and update theirlocal copy of the parameter matrix: W_(p)←W_(p)−Σ_(q=1) ⁴u_(p)v_(p)^(T). While the above description reflects synchronous execution, it iseasy to extend to (bounded) asynchronous execution.

Since an updated parameter matrix, referred to as an update matrix (UM),can be computed from a few sufficient factors, sending a update matrixfrom machine A to B can be equivalently done by first transferring thesufficient factors from A to B, then producing the update matrix fromthe sufficient factors received at B.

The communication cost of transmitting sufficient factors is O(J+K)which is linear in matrix dimensions while that of transmitting updatematrices is O(JK) which is quadratic in matrix dimensions. Hencesufficient factor transfer, instead of update matrix transfer, cangreatly reduce communication overhead. The transformation fromsufficient factors to an update parameter matrix is mathematicallyexact, without compromising computational correctness.

In parameter server, the one-sided communication cost from workermachines to the server can be reduced by transmitting sufficientfactors. In this case, each worker machine sends new sufficient factorgroups to the server, where the received sufficient factor groups aretransformed to update matrices to update the shared state of theparameter matrix. However, since the parameter matrix cannot be computedfrom a few sufficient factors, from the server to worker machines thenewly-updated parameters need to be sent as a matrix, which still incurshigh communication overhead. To avoid transmitting parameter matrices,the large-scale distributed machine learning system disclosed hereinadopts a decentralized peer-to-peer architecture, where worker machinessynchronize their parameter replicas by exchanging updates in the formof sufficient factors. In each clock, each worker machine computessufficient factor groups and broadcasts them to other worker machines.Meanwhile, each worker machine converts the sufficient factor groupsreceived remotely into update matrices which are subsequently added tothe parameter replica resident in the worker machine. This computationmodel is referred to as sufficient factor broadcasting. Unlike parameterserver, the decentralized peer-to-peer architecture of the large-scaledistributed machine learning system disclosed herein does not maintainthe shared state of the parameter matrix and can avoid transmittingmatrices.

While the transfer of sufficient factor groups among peer-to-peermachines greatly reduces communication cost, such transfer increasescomputation overhead because each sufficient factor group is convertedinto the same update at each of the peer-to-peer machines. Thus thesufficient factor group is converted multiple times. However, in-memorycomputation is usually much more efficient than inter-machine networkcommunication, especially with the advent of graphics processor unit(GPU) computing, hence the reduction in communication cost overshadowsthe increase of computation overhead.

Random Multicast

While the peer-to-peer transfer of sufficient factor groups greatlyreduces the size of each message from a matrix to a few vectors, alimitation of such transfer is that a large number of sufficient factorgroups needs to be sent from each machine in the system to every othermachine in the system, which renders the number of messages per clock tobe quadratic in the number of machines P. To address this issue, thelarge-scale distributed machine learning system disclosed herein adoptsrandom multicast. During random multicast, in each clock cycle, eachmachine randomly selects Q(Q<P−1) machines to send one or moresufficient factor groups to. This reduces the number of messages sentper clock cycle from O(P²) to O(PQ).

FIG. 3 shows an example of random multicast. In each iteration t, anupdate U_(p) ^(t) generated by machine p is sent only to machines thatare directly connected with p (and the update U_(p) ^(t) takes effect atiteration t+1). The effect of U_(p) ^(t) is indirectly and eventuallytransmitted to every other machine q, via the updates generated bymachines sitting between p and q in the topology. This happens atiteration t+τ, for some delay τ>1 that depends on Q and the location ofp and q in the network topology. Consequently, the P machines will nothave the exact same parameter image W, even under bulk synchronousparallel execution—yet this does not empirically compromise algorithmaccuracy as long as Q is not too small.

Two random selection methods are provided. One is uniform selection:each machine has the same probability to be selected. The other isprioritized selection for load-balancing purpose. Each machine isassigned a priority score based on its progress (measured by clock). Amachine with faster progress (higher priority) is selected with higherprobability and receives more sufficient factors from slower machines.It spends more compute cycles to consume these remote sufficient factorsand slows down the computation of new sufficient factors, giving theslower machines a grace time to catch up.

Unlike a deterministic multicast topology, where each machinecommunicates with a fixed set of machines throughout the applicationrun, random multicast provides several benefits. First, dynamicallychanging the topology in each clock gives every two machines a chance tocommunicate directly, which facilitates more symmetric synchronization.Second, random multicast is more robust to network connection failuressince the failure of a network connection between two machines will notaffect their communication with another one. Third, random multicastmakes resource elasticity simpler to implement because adding andremoving machines require minimal coordination with existing ones,unlike a deterministic topology which must be modified every time aworker joins or leaves.

Sufficient Factor Selection

In machine learning practice, parameter updates are usually computedover a small batch (whose size typically ranges from tens to hundreds)of examples. At each clock, a batch of K training examples or data areselected and parameter updates are generated with respect to eachexample. When represented as matrices, these K updates can be aggregatedinto a single matrix to communicate. Hence the communication cost isindependent of the number of K updates. However, this is not the case insufficient factor transfer. In sufficient factor transfer, a batch K ofsufficient factor groups cannot be aggregated into one single sufficientfactor group. Instead they are transferred individually. Therefore,communication cost grows linearly with K.

To alleviate this cost, the large-scale distributed machine learningsystem disclosed herein provides for sufficient factor selection (SFS).During sufficient factor selection, a machine chooses a subset of Csufficient factor groups from a set or batch K of sufficient factorgroups (where C<K) to send to other machines. The chosen subset of Csufficient factor groups correspond to the sufficient factor groups thatbest represent the entire batch.

An efficient sampling-based algorithm called joint matrix column subsetselection

(JMCSS) performs sufficient factor selection. Given the P matrices X⁽¹⁾,. . . , X^((P)) where X^((P)) stores the p-th sufficient factor of allsufficient factor groups, JMCSS selects a subset of non-redundant columnvectors from each matrix to approximate the entire matrix. The selectionof columns in different matrices are tied together, i.e., if the i-thcolumn is selected in one matrix, for all other matrices their i-thcolumn must be selected as well to atomically form an sufficient factorgroup. Let I=i₁, . . . , i_(C) index the selected sufficient factorgroups and S_(I) ^((P)) be a matrix whose columns are from X^((P)) andindexed by I. The goal is to find out the optimal selection I such thatthe following approximation error is minimized: τ_(p=1)^(p)∥x^((P))−S_(I) ^((p))(S_(I) ^((p)))^(†)X_(I) ^((p))∥₂, where (S_(I)^((p)))^(†) is the pseudo-inverse of S_(I) ^((p)).

Finding the exact solution of this problem is NP-hard. To address thisissue, a sampling-based method (Algorithm 3), which is an adaptation ofthe iterative norm sampling algorithm, is used. This algorithm is shownin FIG. 4. Let S^((p)) be a dynamically growing matrix that stores thecolumn vectors to be selected from X^((p)) and S^((p)) denote the stateof S^((p)) at iteration t. Accordingly, X^((p)) is dynamically shrinkingand its state is denoted by X_(t) ^((p)). At the t-th iteration, anindex it is sampled and the i_(t)-th column vectors are taken out from{X^((p))}_(p=1) ^(P) and added to {S^((p))}_(p=1) ^(P). i_(t) is sampledin the following way. First, the squared L2 norm of each column vectorin {X_(t−1) ^((p))}_(p=1) ^(P) is computed. Then sample i_(t)(1≤i_(t)≤K+1−t) with probability proportional to Π_(p=1) ^(P)∥X_(i) _(t)^((p))∥₂ ², where X_(i) _(t) ^((p)) denotes i_(t)-th.

Then a back projection is utilized to transform X_(t) ^((p)): X_(t)^((p))←X_(t) ^((p))−S_(t) ^((p))(S_(t) ^((p)))^(†)X_(t) ^((p)). After Citerations, the selected sufficient factors contained in {S^((p))}_(p=1)^(P) are obtained and packed into sufficient factor groups, which aresubsequently sent to other machines. Under JMCSS, the aggregated updategenerated from the C sufficient factor groups is close to that computedfrom the entire batch. Hence sufficient factor selection does notcompromise parameter-synchronization quality.

The selection of sufficient factors is pipelined with their computationand communication to increase throughput. Two FIFO queues (denoted by Aand B) containing sufficient factors are utilized for coordination. Thecomputation thread adds newly-computed sufficient factors into queue A.The selection thread dequeues sufficient factors from A, executes theselection and adds the selected sufficient factors to queue B. Thecommunication thread dequeues sufficient factors from B and sends themto other machines. The three modules operate asynchronously: for eachone, as long as its input queue is not empty and output queue is notfull, the operation continues. The two queues can be concurrentlyaccessed by their producer and consumer.

Sufficient Factor Representation of Parameters

First, a sufficient-factor representation (SFR) of the parameters ispresented. At clock T, the parameter state W_(T) is mathematically equalto W₀+Σ_(t=1) ^(T)ΔW_(t) where ΔW_(t) is the update matrix computed atclock t and W₀ is the initialization of the parameters. As notedearlier, ΔW_(t) can be computed from a sufficient factor group G_(t):DW_(t)=h(G_(t)), using a transformation h. To initialize the parameters,a sufficient factor group G₀, is randomly generated then let W₀=h(G₀).To this end, the parameter state can be represented as W_(T)=Σ_(t=1)^(T)ΔW_(t)h(G_(t)), using a set of sufficient factor groups. Thesufficient-factor representation can be leveraged to reduce computationcost. First of all, since no parameter matrix needs to be maintained,there is no need to explicitly compute the update matrix in each clock,which otherwise incurs O(JK) cost.

Second, in most matrix-parameterized models, a major computationworkload is to multiply the parameter matrix by a vector, whose cost isquadratic in matrix dimensions. This cost is reduced by executing themultiplication in a sufficient-factor-aware way. The details are givenin the following subsection.

Incremental Sufficient Factor Checkpoint

Based on the SF-representation (SFR) of parameters and inspired by theasynchronous and incremental checkpointing methods, the large-scaledistributed machine learning system disclosed herein provides anincremental sufficient factor checkpoint (ISFC) mechanism for faulttolerance and recovery: each machine continuously saves the newsufficient factor groups computed in each clock to stable storage andrestores the parameters from the saved sufficient factor groups whenmachine failure happens. Unlike existing systems which checkpoint largematrices, saving small vectors consume much less disk bandwidth. Toreduce the frequency of disk write, the sufficient factor groupsgenerated after each clock are not immediately written onto the disk,but staged in the host memory. When a large batch of sufficient factorgroups are accumulated, the large-scale distributed machine learningsystem disclosed herein writes them together.

Incremental sufficient factor checkpoint does not require theapplication program to halt while checkpointing the sufficient factors .The IO thread reads the sufficient factors and the computing threadwrites the parameter matrix. There is no read/write conflict. Incontrast, in matrix-based checkpointing, the IO thread reads theparameter matrix, which requires the computation thread to halt toensure consistency, incurring waste of compute cycles.

Incremental sufficient factor checkpoint is able to rollback theparameters to the state at any clock. To obtain the state at clock T,the large-scale distributed machine learning system disclosed hereincollects the sufficient factor groups computed up to T and transformsthem into a parameter matrix. This granularity is much more fine-grainedthan checkpointing parameter matrices. Since saving large-sized matricesto disk is time-consuming, the system can only afford to perform acheckpoint periodically and the parameter states between two checkpointsare lost. The restore(T) API is used for recovery where T is auser-specified clock which the parameters are to be rollbacked to. Thedefault T is the latest clock.

Sufficient-Factor-Aware Multiplication and Tree Rewriting

In multiclass logistic regression (MLR), each sufficient factor groupcontains two sufficient factors u, v, whose outer product uv^(T)produces a parameter update. Consequently, the sufficient-factorrepresentation of the parameter state W_(T) is Σ_(t=0) ^(T)u_(t)v_(t)^(T). The multiplication between W_(T) and a vector x can be computed inthe following way: W_(T) * x=(Σ_(t=0) ^(T)u_(t)v_(t) ^(T))x=Σ_(t=0)^(T)u_(t) (v_(t) ^(T)x), which first calculates the inner product v_(t)^(T)x between v_(t) and x, then multiplies the inner product with u_(t).The computation cost is O(T(J+K)), which is linear in matrix dimensionsand grows with T . As another example, each sufficient factor groupcontains two sufficient factors and the parameter update is computed asΔW=uu^(T)−vv^(T). Then W_(T) is represented as Σ_(t=1)^(T)uu^(T)−vv^(T)—and W_(T) * x can be computed as (Σ_(t=0)^(T)u_(t)(u_(t) ^(T)x)−v_(t)(v_(t) ^(T)x), whose cost is O(T(J+K)) aswell. When T is small, sufficient factor-aware multiplication is highlyefficient.

The large-scale distributed machine learning system disclosed herein mayuse a multiplication tree to perform sufficient factor-awaremultiplication. A multiplication tree is rewritten from an updating treebuilt by parsing the compute update function which is either defined byusers or automatically identified by the system. At the leaf nodes ofthe updating tree are sufficient factors and at the internal nodes areoperations. An in-order traversal of the updating tree transforms thesufficient factors into an update matrix: at each internal node, theassociated operation is applied to the data objects (either sufficientfactors or matrices) at its two children. The update matrix is obtainedat the root.

Given this updating tree, it is rewritten into an multiplication tree.For each subtree in the updating tree, if the operation at the root isvector outer-product (denoted by “ ”) and children of the root are twosufficient factors sv0 and sv 1, then the subtree is transformed into anew tree with three layers: at the root is scalar-vector multiplication“*”; at the two children of the root are sv0 and vector inner-product “”; the two children of “ ” are sv1 and x (the vector involved inW_(T)*x). “+” and “−” representing matrix addition/subtraction in theupdating tree are replaced with vector addition/subtraction in themultiplication tree.

To compute W_(T)*x, where W_(T) is represented with T+1 sufficientfactor groups, the sufficient factors are fed into each sufficientfactor group and x into the leave nodes of multiplication tree, then anin-order traversal is performed to get a vector at the root. W_(T)*x isobtained by adding up the vectors generated from all sufficient factorgroups.

Programming Model

The programming model of the large-scale distributed machine learningsystem disclosed herein provides a data abstraction called sufficientfactor group and two user-defined functions that generate and consumesufficient factor groups to update model parameters. Each sufficientfactor group contains a set of sufficient factors that are generatedwith respect to one data example and atomically produces a parameterupdate. The sufficient factors are immutable and dense, and theirdefault type is float. Inside an sufficient factor group, eachsufficient factor has an index.

To program an application for execution by the large-scale distributedmachine learning system disclosed herein, users specify two functions:(1) compute_update which takes the current parameter state and one dataexample as inputs and computes vectors that collectively form ansufficient factor group; (2) compute_update which takes an sufficientfactor group and produces a parameter update. These two functions areinvoked by the disclosed engine to perform data-parallel distributedmachine learning: each of the P machines holds one shard of the trainingdata and a replica of parameters; different parameter replicas aresynchronized across machines to retain consistency (consistency meansdifferent replicas are encouraged to be as close as possible).

Every machine executes a sequence of operations iteratively: in eachclock, a small batch of training examples are randomly selected from thedata shard and compute_svg is invoked to compute an sufficient factorgroup with respect to each example; the sufficient factor groups arethen sent to other machines for parameter synchronization;compute_update is invoked to transform locally-generated sufficientfactor groups and remotely-received sufficient factor groups intoupdates which are subsequently added to the parameter replica. Theexecution semantics (per-clock) of the disclosed engine is shown in FIG.5, as Algorithm 4. Unlike existing systems which directly computeparameter updates from training data, the large-scale distributedmachine learning system disclosed herein breaks this computation intotwo steps and explicitly exposes the intermediate sufficient factors tousers, which enables SF-based system-level optimizations to beexploited.

Below shows how these two functions are implemented in multiclasslogistic regression. The inputs of the compute svg_function include theparameter replica Parameters and a data example Data and the output is aSFG. A sufficient factor group is declared via SFG([d₁, . . . , d_(j)])where d_(j) is the length of the j-th SF. In multiclass logisticregression, an sufficient factor group contains two sufficient factors:the first one is the difference between the prediction vector softmax(W * smp.feats) and the label vector smp. label; the second one isthe feature vector smp. feats. The update matrix is computed as theouter product between the two sufficient factors.

-   -   def compute_svg(Parameters W, Data smp):        -   svg=SFG ([W.nrows, W.ncols])        -   x=softmax(W * smp.feats)−smp. label        -   svg.sv[0]=x        -   svg. sv[1]=smp.feats        -   return svg    -   def compute_update(SFG svg):        -   return outproduct(svg. sv[0],svg. sv[1])

Automatic Identification of Sufficient Factors and Updates

When machine learning models are trained using gradient descent orquasi-Newton algorithms, the computation of sufficient factor groups andupdates can be automatically identified by the disclosed engine, whichrelieves users from writing the two functions compute_svg andcompute_update. The only input required from users is a symbolicexpression of the loss function, which is in general much easier toprogram compared with the two functions. Note that this is not an extraburden: in most machine learning applications, users need to specifythis loss function to measure the progress of execution.

The identification procedure of sufficient factors depends on theoptimization algorithm—either gradient descent or quasi-Newton—specifiedby the users for minimizing the loss function. For both algorithms,automatic differentiation techniques are needed to compute the gradientof variables. Given the symbolic expression of the loss function, suchas f=cross_(entropy(softmax(W*x),y)) in multiclass logistic regression,the disclosed engine first parses the expression into an expressiongraph as shown in FIG. 6. In the figure, circles denote variablesincluding terminals such as W, x, y and intermediate ones such as a=W*x, b=softmax(a); boxes denote operators applied to variables. Accordingto their inputs and outputs, operators can be categorized into differenttypes, shown in the table included in FIG. 6. Given the expressiongraph, the large-scale distributed machine learning system disclosedherein uses automatic differentiation to compute the symbolicexpressions of the gradient

$\frac{\partial f}{\partial z}$

of f with respect each unknown variable z (either a terminal or anintermediate one). The computation is executed recursively in thebackward direction of the graph. For example, in FIG. 6, to obtain∂f/∂a,

$\frac{\partial f}{\partial b}$

is first computed, then it is transformea into

$\frac{\partial f}{\partial a}$

using an operator-specific matrix A. For a type-2 operator (e.g.,softmax) in the table in FIG. 6,

$A_{ij} = {\frac{\partial b_{j}}{\partial a_{i}}.}$

If W is involved in a type-5 operator (Table 5.1) which takes W and avector x as inputs and produces a vector a and the gradient descentalgorithm is used to minimize the loss function, then the sufficientfactor group contains two sufficient factors which can be automaticallyidentified: one is ∂f/∂a and the other is x. Accordingly, the update ofW can be automatically identified as the outer product of the twosufficient factors.

If quasi-Newton methods are used to learn machine learning modelsparameterized by a vector x, the large-scale distributed machinelearning system disclosed herein can automatically identifies thesufficient factors of the update of the approximated Hessian matrix W.First of all, automatic differentiation is applied to compute thesymbolic expression of the gradient g (x)=∂f/∂x. To identify thesufficient factors at clock k, the states x_(k+1) and x_(k) of theparameter vector are plugged in clock k+1 and k into g(x) and calculatea vector y_(k)=g(x_(k+1))−g(x_(k)). Another vector s_(k)=x_(k+1)−x_(k)is computed. Then based on s_(k), y_(k), and W_(s) _(k) (the state of Wat clock k), the sufficient factors which depend on the specificquasi-Newton algorithm instance can be identified. For BFGS, theprocedures are: (1) set y_(k)←y_(k)/√{square root over (y_(k)^(T)s_(k))}; (2) compute v_(k)=W_(k) s_(k); (3) set y_(k)←y_(k)/√{squareroot over (s_(k) ^(T)v_(k))}. Then the sufficient factors are identifiedas y_(k) and v_(k) and the update of W_(k) is computed as y_(k)y_(k)^(T)−v_(k)v_(k) ^(T). For DFP, the procedures are: (1) sets_(k)←s_(k)/√{square root over (6 _(k) ¹s_(k))}; (2)computev_(k)=W_(k)y_(k); (2) set v_(k)/√{square root over (y_(k)^(T)v_(k))}. Then the sufficient factors are identified as s_(k) andv_(k) and the update of W_(k) is computed as s_(k)s_(k) ^(T)−v_(k)^(T)v_(k).

Implementation

With reference to FIG. 7, the large-scale distributed machine learningsystem disclosed herein is a decentralized system, where workers aresymmetric, each running the same software stack, which is conceptuallydivided into three layers: (1) an machine learning application layerincluding machine learning programs implemented on top of the disclosedengine, such as multiclass logistic regression, topic models, deeplearning models, etc.; (2) a service layer for automatic identificationof sufficient factors , sufficient factor selection, fault tolerance,etc.; (3) a peer-to-peer communication layer for sufficient factortransfer and random multicast.

The major modules in the disclosed engine include: (1) an interpreterthat automatically identifies the symbolic expressions of sufficientfactors and parameter updates; (2) a sufficient factor generator thatselects training examples from local data shard and computes ansufficient factor group for each example using the symbolic expressionsof sufficient factors produced by the interpreter; (3) an sufficientfactor selector that chooses a small subset of most representativesufficient factors out of those computed by the generator forcommunication; (4) a communication manager that transfers the sufficientfactors chosen by the selector using broadcast or random multicast andreceives remote sufficient factors; (5) an update generator whichcomputes update matrices from locally-generated and remotely-receivedsufficient factors and updates the parameter matrix; (6) a centralcoordinator for periodic centralized synchronization, parameter-replicasrotation, and elasticity.

Heterogeneous computing: The programming interface of the disclosedsystem exposes a rich set of operators, such as matrix multiplication,vector addition, and softmax, through which users write their machinelearning programs. To support heterogeneous computing, each operator hasa CPU implementation and a GPU implementation built upon highlyoptimized libraries. In the GPU implementation, the disclosed engineperforms kernel fusion which combines a sequence of kernels into asingle one, to reduce the number of kernel lunches that bear largeoverhead. The disclosed engine generates a dependency graph of operatorsby parsing users' program and traverses the graph to fuse consecutiveoperators into one CUDA kernel.

Elasticity: The large-scale distributed machine learning systemdisclosed herein is elastic to resource adjustment. Adding new machinesand preempting existing machines do not interrupt the current execution.To add a new machine, the central coordinator executes the followingsteps: (1) launching the disclosed engine and application program on thenew machine; (2) averaging the parameter replicas of existing machinesand placing the averaged parameters on the new machine; (3) taking achunk of training data from each existing machine and assigning the datato the new machine; (4) adding the new machine into the peer-to-peernetwork. When an existing machine is preempted, it is taken off from thepeer-to-peer network and its data shard is re-distributed to othermachines.

Periodic centralized synchronization: Complementary to the peer-to-peerdecentralized parameter synchronization, large-scale distributed machinelearning system disclosed herein performs a centralized synchronizationperiodically. The centralized coordinator sets a global barrier every Rclocks. When all workers reach this barrier, the coordinator calls theAllReduce(average) interface to average the parameter replicas and seteach replica to the average. After that, workers perform decentralizedsynchronization until the next barrier. Centralized synchronizationeffectively removes parameter-replicas' discrepancy accumulated duringdecentralized execution and it will not incur substantial communicationcost since it is invoked periodically.

Rotation of parameter replicas: The large-scale distributed machinelearning system disclosed herein adopts data parallelism, where eachworker has access to one shard of the data. Since computation is usuallymuch faster than communication, the updates computed locally are muchmore frequent than those received remotely. This would render imbalancedupdating of parameters: a parameter replica is more frequently updatedbased on the local data residing in the same machine than data shards onother machines. This is another cause of out-of-synchronization. Toaddress this issue, the large-scale distributed machine learning systemdisclosed herein performs parameter—replica rotation, which enables eachparameter replica to explore all data shards on different machines.Logically, the machines are connected via a ring network.Parameter-replicas rotate along the ring periodically (every Siterations) while each data shard sits still on the same machine duringthe entire execution. The parameters are rotated rather than data sincethe size of parameters is much smaller than data. A centralizedcoordinator sets a barrier every S iterations. When all workers reachthe barrier, it invokes the Rotate API which triggers the rotation ofparameter replicas.

Data prefetching: The loading of training data from CPU to GPU isoverlapped with the sufficient factor generator via a data queue. Thenext batches of training examples are prefetched into the queue whilethe generator is processing the current one. In certain applications,each training example is associated with a data—dependent variable(DDV). For instance, in topic model, each document has a topicproportion vector. The states of DDVs need to be maintained throughoutexecution. Training examples and their DDVs are stored in consecutivehost/device memory for locality and are prefetched together. At the endof a clock, GPU buffer storing examples is immediately ready foroverwriting. The DDVs are swapped from GPU memory to host memory, whichis pipelined using a DDV queue.

Hardware/Software-Aware Sufficient Factor Transfer

The large-scale distributed machine learning system disclosed hereinprovides a communication library for efficient message broadcasting. Itcontains a collection of broadcast methods designed for differenthardware and software configurations, including (1) whether thecommunication is CPU-to-CPU or GPU-to-GPU; (2) whether InfiniBand isavailable; (3) whether the consistency model is bulk synchronousparallel or staleness synchronous parallel.

CPU-to-CPU, bulk synchronous parallel: In this case, the MPI Allgatherroutine is used to perform all-to-all broadcast. In each clock, itgathers the sufficient factors computed by each machine and distributesthem to all machines. MPI_Allgather is a blocking operation (i.e. thecontrol does not return to the application until the receiving buffer isready to receive sufficient factors from all machines). This is inaccordance with the bulk synchronous parallel consistency model wherethe execution cannot proceed to the next clock until all machines reachthe global barrier.

CPU-to-CPU, staleness synchronous parallel: Under staleness synchronousparallel, each machine is allowed to have a different pace to computeand broadcast sufficient factors. To enable this, the all-to-allbroadcast is decomposed into multiple one-to-all broadcast. Each machineseparately invokes the MPI Bcast routine to broadcast its messages toothers. MPI Beast is a blocking operation: the next message cannot besent until the current one finishes. This guarantees the sufficientfactors are received in order: sufficient factors generated at clock tarrive early than those at t+1. This order is important for thecorrectness of machine learning applications: the updates generatedearlier should be applied first.

CPU-to-CPU, bulk synchronous parallel, InfiniBand: An all-gatheroperation is executed by leveraging the Remote Direct Memory Access(RDMA) feature provided by InfiniBand, which supports zero- copynetworking by enabling the network adapter to transfer data directly toor from application memory, without going through the operating system.The recursive doubling (RD) algorithm is used to implement all-gather,where pairs of processes exchange their sufficient factors viapoint-to-point communication. In each iteration, the sufficient factorscollected during all previous iterations are included in the exchange.RDMA is used for the point-to-point transfer during the execution ofrecursive doubling.

CPU-to-CPU, staleness synchronous parallel, InfiniBand: Each machineperforms one-to-all broadcast separately, using the hardware supportedbroadcast (HSB) in InfiniBand. HSB is topology-aware: packets areduplicated by the switches only when necessary; therefore networktraffic is reduced by avoiding the cases that multiple identical packetstravel through the same physical link. The limitation of hardwaresupported broadcast is that messages can be dropped or arrive out oforder, which degrades the correctness of machine learning execution. Toretain reliability and in-order delivery, on top of hardware supportedbroadcast another layer of network protocol is added, where (1)receivers send ACKs back to the root machine to confirm messagedelivery; (2) a message is re-transmitted using point-to-point reliablecommunication if no ACK is received before timeout; (3) receivers use acontinuous clock counter to detect out-of-order messages and put them inorder.

GPU-to-GPU: To reduce the latency of inter-machine sufficient factortransfer between two GPUs, the GPUDirect RDMA provided by CUDA is used,which allows network adapters to directly read from or write to GPUdevice memory, without staging through host memory. Between two networkadaptors, the sufficient factors are communicated using the methodslisted above.

Similar to broadcast, several multicast methods tailored to differentsystem configurations are provided.

CPU-to-CPU: MPI group communication primitives are used for CPU-to-CPUmulticast. In each clock, MPI_Comm split is invoked to split thecommunicator MPI_COMM_WORLD into a target group (containing the selectedmachines) and a non-target group. Then the message is broadcast to thetarget group.

CPU-to-CPU, InfiniBand: The efficient but unreliable multicast methodsupported by InfiniBand at hardware level and a reliable point-to-pointnetwork protocol are used together. InfiniBand combines the selectedmachines into a single multicast address and sends the message to it.Point-to-point re-transmission is issued if no ACK is received beforetimeout. Since the selection of receivers is random, any machine doesnot receive messages in continuous clocks from another machine, makingit difficult to detect out-of-order messages. A simple approach isadopted: a message is discarded if it arrives late.

GPU-to-GPU: GPUDirect remote direct memory access is used to copybuffers from

GPU memory to network adaptor. Then the communication between networkadaptors is handled using the two methods given above.

Medical Topic Discovery Based on Large-Scale Distributed Learning withSufficient Factors

As generally described above with reference to FIG. 1, medical topicdiscovery on a large corpus of clinical documents 112 may be efficientlyperformed by dividing the initial task of document processing among theprocessors 102, 104, 106 in the system and sharing the results of thedocument processing, e.g., the sufficient factor groups, across thesystem. The sharing of results enables the creation of a topic matrixthrough an iterative approach, where an initial topic matrix derivedfrom an initial set of documents is updated as additional sets ofdocuments are processed by the system 100.

Creation of the topic matrix by the medical topic discovery system 100involves initial document processing steps that provide a measure ofassociation between words and medical topics in documents. In oneapproach to document processing, referred to as topic models, clinicaldocuments are represented by un-ordered sets of words. A “topic” is thusa set of words that tend to co-occur, and represents word co-occurrencepatterns that are shared across multiple documents. A “word” in adocument may correspond to a medical condition, symptoms, patientdemographics, etc. See, e.g., Devendra S. Sachan, Pengtao Xie, and EricP. Xing. Effective use of bidirectional language modeling for medicalnamed entity recognition. arXiv preprint arXiv:1711.07908, 2017

The system 100 of FIG. 1 may implemented by a large-scale peer-to-peerdistributed machine learning system including a plurality of processorsconfigured in accordance with the models and features described abovewith reference to FIG. 2-8. In the system 100 of FIG. 1, topicsaddressed in medical documents may be learned as follows: Each medicaldocument is represented with a bag-of-words vector d ∈ R^(V) where V isthe vocabulary size. Each document is assumed to be an approximatelinear combination of K topics: d≈Wθ. W ∈ R^(V×K) is the topic matrix,where each topic is represented with a sufficient factor in the form ofa vector w ∈ R^(V). w_(v)≥0 denotes the association strength between thev-th word and this topic, and Σ_(v=1) ^(V)w_(v)=1. θ ∈ R^(K) are thelinear combination weights satisfying θ_(k)≥0 and Σ_(k=1) ^(K)θ_(k)=1.θ_(k)≥0 denotes how relevant topic k is to the document.

Given the unlabeled documents {d_(i)}_(i=1) ^(N), the topics in thesedocuments are learned by solving the following problem:

$\begin{matrix}\begin{matrix}{{\min\limits_{s.t.}}_{{\{\theta_{i}\}}_{i = 1}^{N},w}{\frac{1}{2}{\sum\limits_{i = 1}^{N}{{d_{i} - {W\; \theta_{i}}}}_{2}^{2}}}} \\{{{{\forall k} = 1},\ldots \mspace{14mu},K,{v = 1},\ldots \mspace{14mu},}} \\{{V,{W_{kv} \geq 0},{{{\sum\limits_{v = 1}^{V}W_{kv}} = 1};}}} \\{{{{\forall i} = 1},\ldots \mspace{14mu},N,{k = 1},\ldots \mspace{14mu},}} \\{{K,{\theta_{ik} \geq 0},{{\sum\limits_{k = 1}^{K}\theta_{ik}} = 1},}}\end{matrix} & ( {{Eq}.\mspace{14mu} 3} )\end{matrix}$

where {θ_(i)}_(i=1) ^(N) denotes all the linear coefficients.

This problem can be solved by alternating between {θ_(i)}_(i=1) ^(N) andW. The N sub-problems defined on {θ_(i)}_(i=1) ^(N) can be solvedindependently by each machine based on the data shard, e.g. clinicaldocuments, it has. The sub-problem defined on W is solved using thelarge-scale distributed machine learning system disclosed herein. Eachmachine in the system maintains a local copy or replica of the parameterstate W or topic matrix and the various copies among the machines aresynchronized to ensure convergence. The projected stochastic gradientdescent algorithm is applied, which iteratively performs two steps: (1)stochastic gradient descent, and (2) projection onto the probabilitysimplex.

In the first step, the stochastic gradient matrix computed over onedocument can be written as the outer product of two vectors:(Wθ_(i)−d_(i))θ_(i) ^(T). In other words, this problem has a sufficientfactor property and fits into the sufficient factor broadcastingframework described above. In each iteration, on the transmitter side,each machine computes a sufficient factor group and sends the sufficientfactor groups to other machines in the system. On the receiver side,each machine converts the sufficient factor groups it receives intogradient matrices which are applied to the receiver's local copy of thestate of the topic matrix to update the topic matrix. A projectionoperation follows the gradient descent update.

FIG. 8 is a flowchart of a method of creating a topic matrix thatrepresents a prevalence of each of a plurality of medical topics among aplurality of clinical documents. The method may be performed, forexample, by a system of processors 102, 104, 106 shown in FIG. 1 andconfigured in accordance with the machine learning and sufficientfactors models and features described above, including those shown inFIGS. 2-7.

At block 802, a first professor 102 included in a machine learningsystem 100 comprising a plurality of machine learning processors 102,104, 106 determines at least one local sufficient factor group for oneor more documents included in the plurality of clinical documents. Thelocal sufficient factor group comprises two sufficient factors, eachcorresponding to a vector representing a measure between words in thedocument and a medical topic.

At block 804, the first processor 102 sends the at least one localsufficient factor group to one or more second processors 104, 106 in thesystem. In one embodiment, a plurality of local sufficient factor groupsare determined for a corresponding plurality of documents. In this case,the first processor 102 selects and sends a subset of the plurality oflocal sufficient factor groups to the one or more other processors. Inanother embodiment, the first processor 102 randomly selects, from amonga plurality of second processors in the system, the one or more secondprocessors 104, 106 to which to send the local sufficient factor group.

At block 806, the first processor 102 receives at least one remotesufficient factor group from a second processor 104, 106 in the system.The at least one remote sufficient factor group is determined by thesecond processor for another document included in the plurality ofclinical documents. The remote sufficient factor group comprises twosufficient factors, each corresponding to a vector representing ameasure between words in the other document and a medical topic.

At block 808, the first processor 102 processes the local sufficientfactor group together with the remote sufficient factor group to obtainthe topic matrix. In one embodiment, the first processor 102 processesthe sufficient factor groups by converting each of the local sufficientfactor group and the remote sufficient factor group into a correspondingupdate matrix; and applying each update matrix to the topic matrix usinga projection operation. Each of the local sufficient factor group andthe remote sufficient factor group are converted into a correspondingupdate matrix by obtaining an outer product of the sufficient factorsthat respectively define the local sufficient factor group and theremote sufficient factor group.

The method of FIG. 8 may be performed by each processor 102, 104, 106 inthe system. In this case, each processor shares its sufficient factorgroup or groups with other processors in the system and each processoris able to process its own sufficient factors with those it receives, tothereby obtain and maintain a common, shared topic matrix.

As noted above, the topic matrix represents a prevalence of each of aplurality of medical topics among a plurality of clinical documents. Inan example practical application of the topic matrix, an medical topicsearch inquiry may be input to one of the processors in the system 100,through a user interface, and the topic matrix may be accessed to obtaina listing of documents most relevant to the medical topic.

FIG. 9 is a schematic block diagram of an apparatus 900. The apparatus900 may correspond to one or more processors or machines of the medicaltopic discovery system of FIG. 1 configured to enable the method of FIG.8. The apparatus 900 may be embodied in any number of processor-drivendevices, including, but not limited to, a server computer, a personalcomputer, one or more networked computing devices, anapplication-specific circuit, a minicomputer, a microcontroller, and/orany other processor-based device and/or combination of devices.

The apparatus 900 may include one or more processing units 902configured to access and execute computer-executable instructions storedin at least one memory 904. The processing unit 902 may be implementedas appropriate in hardware, software, firmware, or combinations thereof.Software or firmware implementations of the processing unit 902 mayinclude computer-executable or machine-executable instructions writtenin any suitable programming language to perform the various functionsdescribed herein. The processing unit 902 may include, withoutlimitation, a central processing unit (CPU), a digital signal processor(DSP), a reduced instruction set computer (RISC) processor, a complexinstruction set computer (CISC) processor, a microprocessor, amicrocontroller, a field programmable gate array (FPGA), aSystem-on-a-Chip (SOC), or any combination thereof. The apparatus 900may also include a chipset (not shown) for controlling communicationsbetween the processing unit 902 and one or more of the other componentsof the apparatus 900. The processing unit 902 may also include one ormore application-specific integrated circuits (ASICs) orapplication-specific standard products (ASSPs) for handling specificdata processing functions or tasks.

The memory 904 may include, but is not limited to, random access memory(RAM), flash RAM, magnetic media storage, optical media storage, and soforth. The memory 904 may include volatile memory configured to storeinformation when supplied with power and/or non-volatile memoryconfigured to store information even when not supplied with power. Thememory 904 may store various program modules, application programs, andso forth that may include computer-executable instructions that uponexecution by the processing unit 902 may cause various operations to beperformed. The memory 904 may further store a variety of datamanipulated and/or generated during execution of computer-executableinstructions by the processing unit 902.

The apparatus 900 may further include one or more interfaces 906 thatmay facilitate communication between the apparatus and one or more otherapparatuses in the system 100 or an apparatus outside the system. Forexample, the interface 906 may be configured to transmit/receivesufficient factors to/from other processor or machines in the medicaltopic discovery system 100. The interface 906 may also be configured toreceive one or more of clinical documents or vector representations ofsuch documents from a corpus of clinical documents 112 stored in adatabase.

Communication may be implemented using any suitable communicationsstandard. For example, a LAN interface may implement protocols and/oralgorithms that comply with various communication standards of theInstitute of Electrical and Electronics Engineers (IEEE), such as IEEE802.11, while a cellular network interface implement protocols and/oralgorithms that comply with various communication standards of the ThirdGeneration Partnership Project (3GPP) and 3GPP2, such as 3G and 4G (LongTerm Evolution), and of the Next Generation Mobile Networks (NGMN)Alliance, such as 5G.

The memory 904 may store various program modules, application programs,and so forth that may include computer-executable instructions that uponexecution by the processing unit 902 may cause various operations to beperformed. For example, the memory 904 may include an operating systemmodule (O/S) 908 that may be configured to manage hardware resourcessuch as the interface 906 and provide various services to applicationsexecuting on the apparatus 900.

The memory 904 stores additional program modules such as: (1) aninterpreter module 910 that automatically identifies the symbolicexpressions of sufficient factors and parameter updates; (2) asufficient factor generator module 912 that selects training examplesfrom a local data shard and computes an sufficient factor group for eachexample using the symbolic expressions of sufficient factors produced bythe interpreter module 910; (3) an sufficient factor selector module 914that chooses a small subset of most representative sufficient factorsout of those computed by the SF generator module 912 for communication;(4) a communication manager module 916 that transfers the sufficientfactors chosen by the SF selector module 914 using broadcast or randommulticast and receives remote sufficient factors; (5) an updategenerator module 918 which computes update matrices fromlocally-generated and remotely-received sufficient factors and updatesthe topic matrix; and (6) a central coordinator module 920 for periodiccentralized synchronization, parameter-replicas rotation, andelasticity. Each of these modules includes computer-executableinstructions that when executed by the processing unit 902 cause variousoperations to be performed, such as the operations described above.

The apparatus 900 and modules disclosed herein may be implemented inhardware or software that is executed on a hardware platform. Thehardware or hardware platform may be a general purpose processor, adigital signal processor (DSP), an application specific integratedcircuit (ASIC), a field programmable gate array (FPGA) or otherprogrammable logic component, discrete gate or transistor logic,discrete hardware components, or any combination thereof, or any othersuitable component designed to perform the functions described herein. Ageneral-purpose processor may be a microprocessor, but in thealternative, the processor may be any conventional processor,controller, microcontroller, or state machine. A processor may also beimplemented as a combination of computing components, e.g., acombination of a DSP and a microprocessor, a plurality ofmicroprocessors, one or more microprocessors in conjunction with a DSP,or any other such configuration.

Software shall be construed broadly to mean instructions, instructionsets, code, code segments, program code, programs, subprograms, softwaremodules, applications, software applications, software packages,routines, subroutines, objects, executables, threads of execution,procedures, functions, etc., whether referred to as software, firmware,middleware, microcode, hardware description language, or otherwise. Thesoftware may reside on a computer-readable medium. A computer-readablemedium may include, by way of example, a magnetic storage device (e.g.,hard disk, floppy disk, magnetic strip), an optical disk (e.g., compactdisk (CD), digital versatile disk (DVD)), a smart card, a flash memorydevice (e.g., card, stick, key drive), random access memory (RAM), readonly memory (ROM), programmable ROM (PROM), erasable PROM (EPROM),electrically erasable PROM (EEPROM), a general register, or any othersuitable non-transitory medium for storing software.

While various embodiments have been described above, they have beenpresented by way of example only, and not by way of limitation.Likewise, the various diagrams may depict an example architectural orother configuration for the disclosure, which is done to aid inunderstanding the features and functionality that can be included in thedisclosure. The disclosure is not restricted to the illustrated examplearchitectures or configurations, but can be implemented using a varietyof alternative architectures and configurations.

In this document, the terms “module” and “engine” as used herein, refersto software, firmware, hardware, and any combination of these elementsfor performing the associated functions described herein. Additionally,for purpose of discussion, the various modules are described as discretemodules; however, as would be apparent to one of ordinary skill in theart, two or more modules may be combined to form a single module thatperforms the associated functions according embodiments of theinvention.

In this document, the terms “computer program product”,“computer-readable medium”, and the like, may be used generally to referto media such as, memory storage devices, or storage unit. These, andother forms of computer-readable media, may be involved in storing oneor more instructions for use by processor to cause the processor toperform specified operations. Such instructions, generally referred toas “computer program code” (which may be grouped in the form of computerprograms or other groupings), when executed, enable the computingsystem.

Terms and phrases used in this document, and variations thereof, unlessotherwise expressly stated, should be construed as open ended as opposedto limiting. As examples of the foregoing: the term “including” shouldbe read as meaning “including, without limitation” or the like; the term“example” is used to provide exemplary instances of the item indiscussion, not an exhaustive or limiting list thereof; and adjectivessuch as “conventional,” “traditional,” “normal,” “standard,” “known”,and terms of similar meaning, should not be construed as limiting theitem described to a given time period, or to an item available as of agiven time. But instead these terms should be read to encompassconventional, traditional, normal, or standard technologies that may beavailable, known now, or at any time in the future.

Additionally, memory or other storage, as well as communicationcomponents, may be employed in embodiments of the invention. It will beappreciated that, for clarity purposes, the above description hasdescribed embodiments of the invention with reference to differentfunctional units and processors. However, it will be apparent that anysuitable distribution of functionality between different functionalunits, processing logic elements or domains may be used withoutdetracting from the invention. For example, functionality illustrated tobe performed by separate processing logic elements or controllers may beperformed by the same processing logic element or controller. Hence,references to specific functional units are only to be seen asreferences to suitable means for providing the described functionality,rather than indicative of a strict logical or physical structure ororganization.

Furthermore, although individually listed, a plurality of means,elements or method steps may be implemented by, for example, a singleunit or processing logic element. Additionally, although individualfeatures may be included in different claims, these may possibly beadvantageously combined. The inclusion in different claims does notimply that a combination of features is not feasible and/oradvantageous. Also, the inclusion of a feature in one category of claimsdoes not imply a limitation to this category, but rather the feature maybe equally applicable to other claim categories, as appropriate.

The various aspects of this disclosure are provided to enable one ofordinary skill in the art to practice the present invention. Variousmodifications to exemplary embodiments presented throughout thisdisclosure will be readily apparent to those skilled in the art. Thus,the claims are not intended to be limited to the various aspects of thisdisclosure, but are to be accorded the full scope consistent with thelanguage of the claims. All structural and functional equivalents to thevarious components of the exemplary embodiments described throughoutthis disclosure that are known or later come to be known to those ofordinary skill in the art are expressly incorporated herein by referenceand are intended to be encompassed by the claims. Moreover, nothingdisclosed herein is intended to be dedicated to the public regardless ofwhether such disclosure is explicitly recited in the claims. No claimelement is to be construed under the provisions of 35 U.S.C. § 112,sixth paragraph, unless the element is expressly recited using thephrase “means for” or, in the case of a method claim, the element isrecited using the phrase “step for.”

What is claimed is:
 1. A machine learning system for creating a topicmatrix that represents a prevalence of each of a plurality of medicaltopics among a plurality of clinical documents, the system comprising aplurality of machine learning processors, each processor configured to:determine at least one local sufficient factor group for one or moredocuments included in the plurality of clinical documents; send the atleast one local sufficient factor group to one or more other processorsin the system; receive at least one remote sufficient factor group fromanother processor in the system, the at least one remote sufficientfactor group being determined by the other processor for anotherdocument included in the plurality of clinical documents; and processthe local sufficient factor group together with the remote sufficientfactor group to obtain the topic matrix.
 2. The system of claim 1,wherein the local sufficient factor group comprises two sufficientfactors, each corresponding to a vector representing a measure betweenwords in the document and a medical topic.
 3. The system of claim 1,wherein the remote sufficient factor group comprises two sufficientfactors, each corresponding to a vector representing a measure betweenwords in the other document and a medical topic.
 4. The system of claim1, wherein: the processor determines a plurality of local sufficientfactor groups for a corresponding plurality of documents, and theprocessor is further configured to select and send a subset of theplurality of local sufficient factor groups to the one or more otherprocessors in the system.
 5. The system of claim 1, wherein theprocessor sends the at least one local sufficient factor group by beingfurther configured to randomly select, from among a plurality of otherprocessors in the system, the one or more other processors to which tosend the local sufficient factor group.
 6. The system of claim 5,wherein sufficient factor groups are randomly selected based on jointmatrix column subset selection.
 7. The system of claim 1, wherein theprocessor processes the local sufficient factor group together with theremote sufficient factor group by being further configured to: converteach of the local sufficient factor group and the remote sufficientfactor group into a corresponding update matrix; and apply each updatematrix to the topic matrix using a projection operation.
 8. The systemof claim 7, wherein the processor converts each of the local sufficientfactor group and the remote sufficient factor group into a correspondingupdate matrix by being further configured to obtain an outer product ofthe sufficient factors that respectively define the local sufficientfactor group and the remote sufficient factor group.
 9. A method ofcreating a topic matrix that represents a prevalence of each of aplurality of medical topics among a plurality of clinical documents, themethod comprising: determining at a first professor included in amachine learning system comprising a plurality of machine learningprocessors, at least one local sufficient factor group for one or moredocuments included in the plurality of clinical documents; sending fromthe first processor, the at least one local sufficient factor group toone or more second processors in the system; receiving at the firstprocessor, at least one remote sufficient factor group from a secondprocessor in the system, the at least one remote sufficient factor groupbeing determined by the second processor for another document includedin the plurality of clinical documents; and processing at the firstprocessor, the local sufficient factor group together with the remotesufficient factor group to obtain the topic matrix.
 10. The method ofclaim 9, wherein the local sufficient factor group comprises twosufficient factors, each corresponding to a vector representing ameasure between words in the document and a medical topic.
 11. Themethod of claim 9, wherein the remote sufficient factor group comprisestwo sufficient factors, each corresponding to a vector representing ameasure between words in the other document and a medical topic.
 12. Themethod of claim 9, wherein a plurality of local sufficient factor groupsare determined for a corresponding plurality of documents, and furthercomprising: selecting at the first processor, and sending from the firstprocessor a subset of the plurality of local sufficient factor groups tothe one or more other processors in the method.
 13. The method of claim9, wherein sending the at least one local sufficient factor groupcomprises: randomly selecting, from among a plurality of secondprocessors in the system, the second processor to which to send thelocal sufficient factor group.
 14. The method of claim 13, whereinsufficient factor groups are randomly selected based on joint matrixcolumn subset selection.
 15. The method of claim 9, wherein processingthe local sufficient factor group together with the remote sufficientfactor group comprises: converting each of the local sufficient factorgroup and the remote sufficient factor group into a corresponding updatematrix; and applying each update matrix to the topic matrix using aprojection operation.
 16. The method of claim 15, wherein convertingeach of the local sufficient factor group and the remote sufficientfactor group into a corresponding update matrix comprises obtaining anouter product of the sufficient factors that respectively define thelocal sufficient factor group and the remote sufficient factor group.